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Abstract 

Background: One segment of a RNA sequence might be paired with another segment of the same RNA sequence 
due to the force of hydrogen bonds. This two-dimensional structure is called the RNA sequence's secondary 
structure. Several algorithms have been proposed to predict an RNA sequence's secondary structure. These 
algorithms are referred to as RNA folding algorithms. 

Results: We develop cache efficient, multicore, and GPU algorithms for RNA folding using Nussinov's algorithm. 

Conclusions: Our cache efficient algorithm provides a speedup between 1.6 and 3.0 relative to a naive 
straightforward single core code. The multicore version of the cache efficient single core algorithm provides a 
speedup, relative to the naive single core algorithm, between 7.5 and 14.0 on a 6 core hyperthreaded CPU. Our 
GPU algorithm for the NVIDIA C2050 is up to 1582 times as fast as the naive single core algorithm and between 
5.1 and 1 1.2 times as fast as the fastest previously known GPU algorithm for Nussinov RNA folding. 



Background 

An RNA sequence is a chain of nucleotides from the 
alphabet {G (guanine), A (adenine), U (uracil), C (cyto- 
sine)}. One segment of a RNA sequence might be paired 
with another segment of the same RNA sequence due to 
the force of hydrogen bonds. This two-dimensional struc- 
ture is called the RNA sequence's secondary structure. 
Two nucleotides in an RNA sequence can form Watson- 
Crick AU and GC base pairs as well as the GU wobble 
pair. Several algorithms have been proposed to predict an 
RNA sequence's secondary structure. These algorithms 
are referred to as RNA folding algorithms. Waterman and 
Smith [1] and Nussinov et al. [2] made the first attempt in 
1978. Zuker et al. [3] refined Nussinov's algorithm by 
using a thermodynamic energy minimization model, 
which produces more accurate results at the expense of 
greater computational complexity. Although our focus in 
this paper is the simpler Nussinov's algorithm, our strate- 
gies may be applied to Zuker's algorithm as well. 
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The complexity of Nussinov's and Zuker's algorithm is 
0{n ), where n is the length of the RNA sequence to be 
folded. Other RNA folding algorithms with better predic- 
tion properties and higher complexity also exist. When 
folding viral sequences, n ranges from several thousand 
to several million and single-core run time becomes 
excessive and so much effort has gone into developing 
parallel versions of various RNA folding algorithms. 
For example, [4,5] develop a multicore code for an 0{n ) 
folding algorithm while [6] does this for Nussinov's 
algorithm. [7] develops a framework for RNA folding 
algorithms on a cluster and tests this framework using an 
Oin^) (Pknots-RE) and an 0(«*) (Pknots-RG) algorithms 
for the prediction of RNA secondary structure. FPGA 
solutions for secondary structure prediction have been 
proposed in [8-10] and GPU solutions in [11,12]. We 
note that [11] is based on the algorithm of Zuker et al. 
[3] while [12] is based on that of Nussinov et al. [2]. 

We start in this paper by describing the GPU architec- 
ture and programming model. Then we state Nussinov 
et al.'s [2] dynamic programming recurrence for secondary 
structure prediction and we give modifications of these 
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equations as obtained by Chang et al. [12]. These modifi- 
cations simpUfy the parallehzation of the original equa- 
tions and compute the same results. We also describe the 
strategy used in [12] to obtain a GPU algorithm to solve 
the modified equations. A naive implementation of the 
modified equations of [12] together with a cache efficient 
version and multicore versions are given. We note 
that although [6] gives a multicore version of Nussinov's 
algorithm, our multicore version is much simpler and 
provides similar speedup. Then our GPU algorithm for 
the modified equations is described. Experimental and 
benchmark results are presented after that. 

GPU architecture and programming model 

Our work targets the NVIDIA C2050 GPU. Figure 1 
shows the architecture of one streaming multiprocessor 
(SM) of the NVIDIA Fermi line of GPUs of which the 
C2050 is a member. The C2050 comprises 448 processor 
cores grouped into 14 SMs with 32 cores per SM. Each 
SM has 64KB of shared memory/Ll cache that may be 



set up as either 48KB of shared memory and 16KB of LI 
cache or 16KB of shared memory and 48KB of LI cache. 
In addition, each SM has 32K registers. The 14 SMs 
access a common 3GB of DRAM memory, called device 
or global memory, via a 768KB L2 cache. A C2050 is 
capable of performing up to 1.288 TFLOPS of single- 
precision operations and 515 GFLOPS of double preci- 
sion operations. A C2050 connects to the host processor 
via a PCI-Express bus. The master-slave programming 
model in which one writes a program for the host or 
master computer and this program invokes kernels that 
execute on the GPU is supported. GPUs use the SIMT 
(single instruction multiple thread) programming model 
in which the GPU accomplishes a computational task 
using thousands of light weight threads. The threads 
are grouped into blocks and the blocks are organized as 
a grid. While a block of threads may be 1-, 2-, or 
3-dimensional, the grid of blocks may only be 1 or 
2-dimensional. The key challenge in deriving high 
performance on this machine is to be able to effectively 
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minimize the memory traffic between the SMs and the 
global memory of the GPU. This effectively requires 
design of novel algorithmic and implementation app- 
roaches and is the main focus of this paper. 

Nussinov's dynamic programing recurrence 

Let S = aia2—a„ be an RNA sequence where a, e {A, C, 
G, L/} is a nucleotide. Nussinov's algorithm finds the most 
possible secondary structure by maximizing the number of 
bonded pairs (AU, GC or GU). Let C(i,j) be the maximum 
number of bonded pairs in the subsequence a, ... Uj, 1 < i < 
j < n. Nussinov's dynamic programming recurrence for C 
is given below. 



C(i, !-l) = 0 2<i<n 
C[i, i) = 0 1 <i<n 



C[i,i) = max 



C(i + 1,7) 
C(M-l) 

C{i + l,j — 1) + hond{ai, Uj) 
maXi<;,<j{C(i, k) + C{k + 1, j)} 



Here, bond(flj, aj ) is 1 if («;, Uj ) is an AU, GC or GU 
pair and 0 otherwise, and the third equation applies 
when / </. The third equation computes the maximum 
of four terms that have the following significance. 

1 Add unpaired a, to the best structure for subse- 
quence ai+i-Mj , as shown in Figure 2(a). 

2 Add unpaired aj to the best structure for subse- 
quence ai-Mj^i, as shown in Figure 2(b). 

3 Add (ui, Uj ) pair to the best structure for subse- 
quence <a;,+i. ..<?;_!, as shown in Figure 2(c). 

4 Combine two best structures for and a/^+i...aj, 
as shown in Figure 2(d). 

Once the C{i, j), 1 < i <j < n, have been computed, a 
traceback procedure can be used to construct the actual 
secondary structure, which is represented in the matrix 



as a path leading to the maximum score. While this 
traceback procedure takes 0{n) time, the actual compu- 
tation of the C matrix takes 0{n^) time. 

Simplified recurrence and GPU algorithm 

Chang et al. [12] simplified Nussinov's recurrence to the 
following. 



C(i, i) = 0 forl<i<n 
C(i,!+1) = 0 forl<i<n-l 



max 



C(i + l,j — 1) + bond{ai, Uj) 
maXi<k<j{C{i, k) + C(j', k+ I)] 



(1) 
(2) 

(3) 



Like Nussinov's original recurrence, the simplified 
recurrence uses 0{n ) memory and 0{n ) time. However, 
Chang's formulation is easier to parallelize. In a serial 
computation of C, we would start by initializing C{i, i) 
(the main diagonal of C) and C(i, i + 1) (the diagonal just 
above the main one) using Equations 1 and 2 and then 
use Equation 3 to compute the remaining diagonals in 
the order C{i, i + 2) ... C{i, i + n - 1). Figure 3(a) shows 
how the computation progresses. 

In [12], the entire computation is divided into three 
stages as shown in Figure 3(b), namely the initial stage, 
the middle stage and the final stage. In the initial stage, 
since the computation at each element is shallow (the 
distance to the central diagonal is short), one GPU thread 
is assigned to compute one element. No data exchange 
between threads is needed. All threads synchronize before 
moving to the next diagonal. In the middle stage, an 
entire block of threads is assigned to compute one 
element and the parallel reduction method contained in 
CUDA SDK is used. In the final stage, all SMs collabo- 
rate to compute one element because the distance from 
the element to the central diagonal is long and the 
computation for each element is heavy. Again, parallel 
reduction is used in this stage. To reduce accesses to 
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device memory, the GPU algorithm of [12] stores each 
C(i, j) value, i <j in positions {i, j) as well as in the 
otherwise unused position (/, When C(j, ^ + 1) is to 
be read from device memory (i.e., when it is needed in 
the right side of Equation 3), the read is done from 
position {k + 1, /). This changes column reads to row 
reads and better utilizes the L2 and LI caches of the 
target GPU. 

Methods 

Cache efficient and multlcore algorithms 

CPUl (Algorithm 1) is a naive single-core algorithm to 
compute C using the simplified recurrence of Chang et al. 
This algorithm computes M [i][j] = C(i+1,/+1), 0 < i < J 
<n, where n is the length of the RNA sequence R. 

Algorithm 1 CPUl 

Require: RNA sequence R 

Ensure: Array M such that M [i] \j] = C{i + 1, / + 1) 
1: for i<- Oto \R\-1 do 
2: M[i][i]<^0 
3: end for 

4: for i<- 0 to |7?|-2 do 
5: M[i][i +1]<^0 
6: end for 

7: for diag<r- 2 to do 

8: for row <— 0 to \R\-diag-l do 

9: col <— diag + row 

10: a <- R[row]; b <- R[col] 

11: max <— M [row + 1] [col - 1] + bond{a, b) 

12: for k <— row to col-1 do 

13: t<^ M [row] [k] + M [k + I] [col] 

14: max <— max{max, t} 

15: end for 

16: M [row][col] <— max 

17: end for 

18: end for 



By using the lower triangle of M to store the transpose of 
the computed values in the upper triangle of M as is done 
in the GPU implementation of [12], we get a cache efficient 
version of CPUl. To arrive at CPU2, we change Line 13 of 
Algorithm 1 to "t <- M [row][k] + M [col][k + 1]", and 
change Line 16 to "M [row] [col] <— M [col] [row] <— max". 
Values of M [k + 1] [col] locate in the same column 
but values of M [col] [k + 1] locate in the same row, for 
row < k < col. Reading values in a row is more cache 
efficient than reading values in a column. 

Multicore versions of CPUl and CPU2, respectively 
labeled OMPl and OMP2, are obtained by inserting 
OpenMP statements to parallelize the for loops of lines 1, 
4, and 8. 

Our GPU algorithm 

Unlike the GPU algorithm of [12] which computes C by 
diagonals, we use a refinement of the block strategy used 
in [11,6]. Figure 4 shows the 20 x 20 C matrix for the case 
of RNA sequences whose length is « = 20. To compute 
the element labeled "X", elements "a" to "1" are, respec- 
tively, added to elements "A" to "L" and the maximum of 
"a+A", "b+B", ... "1+L" is computed. We note that the com- 
putation for "Y" also requires "A" to "L" and that "X" has 
to be computed before "Y" and "Z" can be computed. 

In our block strategy, we partition the upper triangle of 
the C matrix into square blocks except that adjacent to 
the main diagonal the partitioning is into triangles and 
that at the right end is into possibly non-square rectangles. 
Figure 4 shows the partitioning for the case « = 20 using 
4x4 blocks. Notice the triangles adjacent to the main 
diagonal and the 4x1 non-square partitions at the right 
end. The blocks (whether triangular, square, or non- 
square) are indexed left-to-right top-to-bottom beginning 
with (0, 0). In keeping with the traditional way to number 
blocks for GPU computation, the first coordinate increases 
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Figure 4 Block partitioning of C matrix 



as you move to the right and the second as you move 
down. So "X" (Figure 4) resides in (4, 1), "K" in (4, 4), and 
"P" in (3, 2). Bloclcs on the main diagonal are indexed (/, i) 
and are triangular. For the dependencies in Equation 3, it 
follows that blocks that lie on the same diagonal of blocks 
(i.e., blocks with the index (/, / - k) for some fixed k) are 
independent and so may be computed in parallel but that 
elements within a block are to be computed by diagonals. 
Our 20 X 20 example of Figure 4 has 6 diagonals of blocks 
and so six iterations of computation with each iteration 
computing all blocks on the same diagonal in parallel are 
required. 

As noted earlier, the first diagonal of blocks is comprised 
of triangles. To each triangular block, we assign a thread 
block. The threads within the assigned thread block com- 
pute the elements in the triangular block in diagonal order 
with all elements on the same diagonal being computable 
in parallel. Hence, for this computation, the number of 
thread blocks equals the number of triangular blocks. 

Let us turn our attention now to the remaining blocks (i. 
e., the non-triangular blocks). Notice that when we start 



the computation of the elements in, say, block (4, 1), 
where "X" resides, "a" to "j", and "C" to "L" have already 
been computed, because they are on preceding block 
diagonals. But "k", "1", "A", and "B" have yet to be 
computed. The computation of the maximum of "c+C" to 
"}+]" can be done using a kernel maxKernel (described 
later). This kernel uses registers for temporary values and 
writes these temporary values to shared memory upon 
completion. The final value for "O" can be obtained by 
comparing the temporary maximum value in "O" with "P" 
plus the bond value in Equation 3. Then the maximum of 
"r+O", "q" plus its bond value, and the temporary maxi- 
mum value in "m" is written to "m" as its final value. 
Similarly, for "M", the maximum of "O+R", "Q" plus its 
bond value, and the temporary maximum value in "M" is 
written to "M" as its final value. The computations for "m" 
and "M" can be done in parallel. So the computation 
within element block (4, 1) is done in diagonal order. All 
elements on the same diagonal can be computed in 
parallel with all data residing in shared memory. The 
pseudocode is shown as Algorithm 2. 
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Algorithm 2 Our GPU algorithm 
Require: RNA sequence R, blocked diagonal index D, 
block size BS 
Ensure: C matrix 
1: Register[16] reg 
2: Shared_Memory[BS][BS] sA 
3: Shared_Memory[BS + l\[BS + 1] sC 
4: Global_Memory[BS][BS] gB 
5: Global_Memory[BS\[BS\ gC 
6: tx <r- threadldx.x; ty <— threadldx.y 
7: bx <— blockldx.x; by <— blockldx.y 
8: fl7?ow <^ by X BS; aCol <— aTJow - 1 
9: Mow <— flJ?ou'; Z?Co/ <— £> x fiS - 1 + fl7?ow 
10: for blk<^ I to D - I Ao 

11: 5^4 <— the block starting at (aRow, aCol + blk 
X BS) 

12: <- the block starting at {bRow + blk x BS, 

13: maxKemel{sA, gB, reg) 
14: Syncthreads 
15: end for 
16: sC <— rc^ 

17: for <f <- 1 to BS X 2 - 1 do 

18: for Each element e on diagonal d do 

19: Finish remaining computation 

20: end for 

21: Syncthreads 

22: end for 

23: gC <- sC 

Algorithm 3 maxKernel 

Require: Block sA in shared memory, block gB in glo- 
bal memory 
Ensure: Partial result of reduction in reg 
rO <- gB[0][tx\; rl <- gB[l][tx\ 
r2 <r- gB[2][tx]; r3 <r- gB[3][tx] 
for y <— 0 to 6 do 
for <- 0 to 15 do 

reg[k] <— max{re^[/:], rO + sA[ty x 16 + k] 

y X 4]} 

end for 

^0 <- gBlij + 1) X Mltx] 

II 2 similar for loops for rl and r2 come here 

for A: ^ 0 to 15 do 

reg[k] <— max{re^[^], r3 + sA[ty x 16 + k] 

[/■ X 4 + 3]} 

end for 

r3<^gB[ij + 1) x4 + 3]M 
end for 

for A: ^ 0 to 15 do 

reg[k] <- max{re^[^], rO + sA[ty x 16 + A:] [28]} 
end for 

112 similar for loops for rl and r2 come here 
for k <- 0 to 15 do 

reg[k] <- max{re^[Ar], r3 + sA[ty x 16 + A:] [31]} 
end for 



Description of maxKernel 

The computation of the maximum of "c+C" to "j+J" 
(Figure 4) bears some resemblance to the computation of 
a term in a matrix multiply. So, we can adapt the ideas 
used in matrix multiply kernels to arrive at an efficient 
kernel to find the desired maximum of the sum of pairs. 
In our case (Algorithm 3), we adapt the GPU matrix mul- 
tiply kernel of Volkov and Demmel [13]. The element 
block size used in our implementation is 32 x 32 and a 
thread block is configured as 32 x 2. Each thread com- 
putes 16 elements that lie in the same column as shown in 
Figure 5 (this figure shows only six threads as arrows 
above block C). The 16 elements computed by one thread 
are represented as a slim gray bar in block C. The gray 
area in block A depicts the data needed by the first 
32 threads. This data will be read into shared memory. To 
achieve high throughput from/to device memory, we use 
coalesced memory accesses in which all data accessed by 
one warp (this is the minimum scheduling unit and it con- 
tains 32 threads) falls in the same device memory cache 
line of size 128 bytes. In Figure 5, six threads fetch the 
first row from the gray area of block B. Then each thread 
uses the value just fetched to add with the first column in 
the gray area of block A (which is already read into shared 
memory). In other words, thread i will add B[0] [i] with A 
\j][0] (0 < / < 16) and compare this value with register\j] of 
thread i and update register[J] if necessary. Then B[l][i] is 
added with Alj] [1] and the result is compared with regis- 
ter\j]; the register is updated as needed, 0 < ; < 16. Since 
threads in the same warp will read data in the same row of 
block B, this reading is coalesced and serviced at the 
throughput of LI or L2 cache in case of a cache hit, or at 
the throughput of device memory, otherwise. Besides, all 
threads in the same warp use the same value from block 
A, which resides in shared memory. This value can be 
broadcast to all threads in the same warp. 

We note that [11] also employs a maxKernel but their 
kernel is different from ours. 

Results 

We benchmarked our algorithms using a PC with a 
hyperthreaded 6-core Intel i7x980 3.33GHz CPU and 
12GB DDR3 RAM. The PC also had NVIDIA Tesla 
C2050 GPU. Since only two threads may be scheduled 
per 17 core at any time, a maximum of 12 threads may 
be gainfully used. We used randomly generated RNA 
sequences in our experiments. Since the run time of our 
codes is relatively insensitive to the actual RNA 
sequence in use due to the fact that the entire computa- 
tion is to fill out an « x « matrix, our use of random 
sequences does not materially impact our conclusions. 

Single and multlcore algorithms 

In both the codes OMPl and 0MP2, the work assigned 
to the threads is well balanced by OpenMP and so best 



Li et al. BMC Bioinformatics 2014, 15(Suppl 8):S1 
http://www.biomedcentral.com/1471-2105/15/S8/S1 



Page 7 of 9 



16 



32 




B 
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performance is expected using either 6 or 12 threads. 
Our experiments confirmed this expectation with the use 
of 6 threads generally being better than the use of 
12 threads. So, for our application the overhead of con- 
text switching between the two threads assigned to a 
core when a total of 6 threads are used generally 
exceeded the gains obtainable from having a second 
thread ready in case the first thread stalls from memory 
delay. Table 1 gives the run times for our algorithms 
CPUl, CPU2, OMPl, and OMP2 for n values ranging 
from 3000 to 16000. The columns labeled ratio give the 
ratios CPUl/OMPl and CPU2/OMP2, respectively. 
Although we have 6 cores on our CPU, we are able to 
achieve speedups of almost 5 from the multicore 
versions. By comparison, the far more complex multicore 
code of [6], which uses a blocking strategy similar to that 
used by our GPU code, achieves a simulated speedup of 
6.3 with 4 threads. The speedup reported in [6] is 
referred to as "simulated speedup" because it comes from 
the use of a multicore simulator rather than from actual 
speedup measurements on a real muticore computer. 
However, this simulated speedup ignores several factors 
such as synchronization overhead that will reduce 
speedup in a real environment. Further, the simulated 
speedup of 6.3 is relative to the equivalent of the code 
CPUl. The speedup achieved by OMP2 relative to CPUl 



is between 7.5 and 14.0! We note also that the speedup 
obtained solely from the use of the caching strategy (i.e., 
the ratio CPU1ICPU2) ranges from 1.6 to 3.0. 

GPU algorithms 

We experimented with three versions of our GPU algo- 
rithm. The first is called Oursl, which is as described in 



Table 1 Running time (seconds) of different CPU 
algorithms 



n 


CPUl 


OMPl 


Ratio 


CPUl 


0MP2 


Ratio 


3000 


35.9 


71 


5.1 


22.3 


4.8 


4.6 


4000 


98.1 


18.6 


5.3 


52.8 


11.3 


4.7 


5000 


208.1 


41.6 


5.0 


102.9 


22.2 


4.6 


6000 


363.7 


72.2 


5.0 


177.5 


45.3 


3.9 


7000 


646.1 


1252 


52 


281.3 


61.0 


4.6 


8000 


9244 


197.8 


4.7 


419.6 


92.5 


4.5 


9000 


1461.5 


291.0 


5.0 


596.6 


129.9 


4.6 


10000 


19277 


395.0 


4.9 


819.1 


176.9 


4.6 


11000 


2800.8 


5592 


5.0 


1 088.4 


234.5 


4.6 


12000 


35252 


7414 


4.8 


1413.6 


303.3 


4.7 


13000 


4852.3 


978.8 


5.0 


1 795.4 


3884 


4.6 


14000 


6008.9 


12502 


4.8 


2243.5 


485.2 


4.6 


15000 


7930.0 


16414 


4.8 


2757.3 


594.0 


4.6 


16000 


10120.0 


2380.8 


4.3 


3343.5 


7254 


4.6 
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Table 2 Running time (seconds) of different GPU 
algorithms 



n 


Oursl 


Oursl 


[12] 


OursR 


Rot/ol 


Ratio! 


2000 


0.1 


0.1 


0.3 


0.1 


3.0 


1.0 


6000 


0.6 


0.7 


4.0 


0.8 


6.7 


1.3 


10000 


1.9 


2.2 


164 


3.2 


8.6 


1.7 


14000 


4.5 


5.1 


43.0 


7.9 


9.6 


1.8 


18000 


8.8 


9.9 


89.5 


16.0 


10.2 


1.8 


22000 


15.1 


16.9 


161.7 


28.2 


10.7 


1.9 


26000 


23.9 


26.7 


266.3 


45.8 


11.1 


1.9 


37000 




71.5 











Our GPU algorithm section. In the second version, which 
is called Ours2, device memory usage is reduced by half by 
storing only the upper triangle of the output matrix. This 
upper triangle is mapped into a onedimensional array 
using the standard row-major mapping. Since this version 
uses only half the device memory used by the other 
versions, it may be used on larger instances. In the third 
version, which is called OursR, we replaced our maxKernel 
with the kernel described in [11]. Since we were unable to 
get the GPU code of [11], the kernel used by us was actu- 
ally one we wrote based on the description provided in 
[11]. These three codes were benchmarked against each 
other as well as against the GPU Nussinov code of [12]. 
The maximum size of sequence Ours2 can handle is 
37000 while the other versions can handle sequences of 
size up to 26000. Oursl runs slighdy slower than Oursl as 
shown in Table 2. So, Oursl is recommended only when 
the instance size is large enough to make Oursl nonfeasi- 
ble. Table 2 and Figure 6 show the running time for the 



four different GPU codes. Ratiol in Table 2 shows the 
speedup of Oursl relative to [12] ([12]/OMr5l). Ratiol 
shows OursRIOursl. As can be seen, Oursl is up to 1.9 
times as fast as OursR indicating that a corresponding 
speedup could be obtained for Zuker's algorithm by repla- 
cing the maximum finding kernel used in [11] with our 
kernel for this operation. Also, Oursl is between 3.0 and 
11.1 times as fast as the GPU algorithm of [12]. 

Single core vs multicore vs GPU 

Table 3 gives the speedup obtained by Oursl relative to 
CPU2 and OMPl. Using a GPU, we can do the Nussinov 
computations up to 522.6 times as fast as using a cache 
efficient single core code and up to 113.4 times as fast as 
using a 6-core cache efficient code! Compared to the naive 
single-core code CP U 1, our GPU codes provides a 
speedup of up to 1582! 

Conclusions 

We have developed simple and efficient single and multi- 
core algorithms as well as an efficient GPU code for RNA 
folding based on Nussinov's equations [2]. Our cache effi- 
cient single-core algorithm provides a speedup between 
1.6 and 3.0 relative to a naive straightforward single core 
code. The multicore version of the cache efficient single 
core algorithm provides a speedup, relative to the naive 
single core algorithm, between 7.5 and 14.0 on a 6 core 
hyperthreaded CPU. Our GPU algorithm, Oursl, for the 
NVIDIA C2050 is up to 1582 times as fast as the naive 
single core algorithm and between 3.0 and 11.1 times as 
fast as the fastest previously known GPU algorithm for 
Nussinov RNA folding. With the available 3GB device 
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Table 3 Speedup of Oursi relative to other versions 



n CPU2 0MP2 n CPU2 0MP2 



3000 


157.0 


33.8 


10000 


424.4 


91.7 


4000 


224.7 


48.1 


11000 


4414 


95.1 


5000 


259.8 


56.1 


12000 


465.9 


100.0 


6000 


302.9 


77.3 


13000 


472.3 


102.2 


7000 


341.8 


74.1 


14000 


496.9 


107.5 


8000 


376.0 


82.9 


15000 


503.3 


1084 


9000 


392.2 


854 


16000 


522.6 


1134 



memory on an NVIDIA GPU, Oursl is able to handle 
sequences of length up to 26000. Sequences of length 
between 26000 and 37000 may be handled using Ours2, 
which uses a onedimensional array mapping of the 
upper triangle of the output matrix rather than a two- 
dimensional array that represents the full output matrix. 
Oursl, however, runs slightly slower than Oursl. Our 
methods can be used to speedup up RNA folding using 
Zuker's equations as well [3,11]. 
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